The idea in this notebook is to reduce the dimensionality of the datasets by transforming individual features using classifiers. Once we've done this it will be possible to combine the subject specific datasets into a single global dataset. This might run the risk of overfitting, but it is also a nice way to create a global classifier.
Same initialisation steps as in other notebooks:
In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()
In [2]:
cd ..
In [3]:
import train
import json
import imp
In [4]:
settings = json.load(open('SETTINGS.json', 'r'))
In [5]:
data = train.get_data(settings['FEATURES'][:3])
In [6]:
!free -m
For each feature and each subject we want to train a random forest and use it to transform the data. We also want to appropriately weight the samples due to the unbalanced classes.
Since I'm a big fan of dictionaries it seems like it would be easy to do this with a dictionary iterating over subjects and features and saving predictions.
In [7]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils
In [8]:
imp.reload(utils)
Out[8]:
Below code copied and modified from random forest submission notes:
In [221]:
features = settings['FEATURES'][:3]
In [10]:
subjects = settings['SUBJECTS']
In [11]:
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('scl',scaler),('clf',forest)])
In [166]:
import sklearn.feature_extraction
In [179]:
oneofk = sklearn.preprocessing.OneHotEncoder(sparse=False)
In [180]:
x = np.arange(10)[np.newaxis]
In [181]:
oneofk.fit_transform(x)
Out[181]:
In [240]:
%%time
predictiondict = {}
for feature in features:
print("Processing {0}".format(feature))
for i,subj in enumerate(subjects):
# training step
X,y,cv,segments = utils.build_training(subj,[feature],data)
X = scaler.fit_transform(X)
predictions = np.mean(X,axis=1)
for segment,prediction in zip(segments,predictions):
try:
predictiondict[segment][feature] = [prediction]
except:
predictiondict[segment] = {}
predictiondict[segment][feature] = [prediction]
# add subject 1-of-k vector
subjvector = np.zeros(len(subjects))
subjvector[i] = 1
predictiondict[segment]['subject'] = list(subjvector)
Next, creating the full training set:
In [241]:
segments = list(predictiondict.keys())
In [242]:
predictiondict[segments[0]].keys()
Out[242]:
In [243]:
X = np.array([])[np.newaxis]
train,test = [],[]
for i,segment in enumerate(segments):
row = []
for feature in features+['subject']:
row += predictiondict[segment][feature]
try:
X = np.vstack([X,np.array(row)[np.newaxis]])
except:
X = np.array(row)[np.newaxis]
In [244]:
X
Out[244]:
In [245]:
y = [1 if 'preictal' in segment else 0 for segment in segments]
In [246]:
y = np.array(y)
In [247]:
len(y)
Out[247]:
In [248]:
len(X)
Out[248]:
In [249]:
len(segments)
Out[249]:
In [250]:
cv = sklearn.cross_validation.StratifiedShuffleSplit(y)
In [255]:
weight = len(y)/sum(y)
In [256]:
weights = [weight if i == 1 else 1 for i in y]
In [258]:
for train,test in cv:
forest.fit(X[train],y[train],sample_weight=weight)
predictions = forest.predict_proba(X[test])
score = sklearn.metrics.roc_auc_score(y[test],predictions[:,1])
print(score)
In [261]:
forest.fit(X,y,sample_weight=weights)
Out[261]:
In [264]:
predictiondict = {}
for feature in features:
print("Processing {0}".format(feature))
for i,subj in enumerate(subjects):
X,segments = utils.build_test(subj,[feature],data)
X = scaler.fit_transform(X)
predictions = np.mean(X,axis=1)
for segment,prediction in zip(segments,predictions):
try:
predictiondict[segment][feature] = [prediction]
except:
predictiondict[segment] = {}
predictiondict[segment][feature] = [prediction]
# add subject 1-of-k vector
subjvector = np.zeros(len(subjects))
subjvector[i] = 1
predictiondict[segment]['subject'] = list(subjvector)
In [265]:
segments = list(predictiondict.keys())
In [266]:
X = np.array([])[np.newaxis]
for i,segment in enumerate(segments):
row = []
for feature in features+['subject']:
row += predictiondict[segment][feature]
try:
X = np.vstack([X,np.array(row)[np.newaxis]])
except:
X = np.array(row)[np.newaxis]
In [267]:
import csv
In [268]:
predictiondict = {}
for segment,fvector in zip(segments,X):
predictiondict[segment] = forest.predict_proba(fvector)
In [269]:
with open("output/protosubmission.csv","w") as f:
c = csv.writer(f)
c.writerow(['clip','preictal'])
for seg in predictiondict.keys():
c.writerow([seg,"%s"%predictiondict[seg][-1][-1]])
In [270]:
!head output/protosubmission.csv
In [271]:
!wc -l output/protosubmission.csv
In [272]:
!wc -l output/sampleSubmission.csv
Wrong length, but I submitted it anyway and it got 0.53141. After adding 1-of-k encoded subjects and weightings got 0.56016. So, that should hold when I add more features, or do the above in a smarter way.